estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...

and

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the Cramér–Rao bound (CRB) expresses a lower bound on the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

s of a deterministic (fixed, though unknown) parameter, the variance of any such estimator is at least as high as the inverse of the

Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...

. Equivalently, it expresses an upper bound on the

precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...

(the inverse of variance) of unbiased estimators: the precision of any such estimator is at most the Fisher information. The result is named in honor of

Harald Cramér Harald Cramér (; 25 September 1893 – 5 October 1985) was a Swedish mathematician, actuary, and statistician, specializing in mathematical statistics and probabilistic number theory. John Kingman described him as "one of the giants of statist ...

and

C. R. Rao Calyampudi Radhakrishna Rao FRS (born 10 September 1920), commonly known as C. R. Rao, is an Indian-American mathematician and statistician. He is currently professor emeritus at Pennsylvania State University and Research Professor at the Un ...

, but has independently also been derived by

Maurice Fréchet Maurice may refer to: People *Saint Maurice (died 287), Roman legionary and Christian martyr *Maurice (emperor) or Flavius Mauricius Tiberius Augustus (539–602), Byzantine emperor * Maurice (bishop of London) (died 1107), Lord Chancellor and L ...

Georges Darmois Georges Darmois (24 June 1888 – 3 January 1960) was a French mathematician and statistician. He pioneered in the theory of sufficiency, in stellar statistics, and in factor analysis. He was also one of the first French mathematicians to teach ...

, as well as Alexander Aitken and Harold Silverstone. An unbiased estimator that achieves this lower bound is said to be (fully) '' efficient''. Such a solution achieves the lowest possible

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...

among all unbiased methods, and is therefore the minimum variance unbiased (MVU) estimator. However, in some cases, no unbiased technique exists which achieves the bound. This may occur either if for any unbiased estimator, there exists another with a strictly smaller variance, or if an MVU estimator exists, but its variance is strictly greater than the inverse of the Fisher information. The Cramér–Rao bound can also be used to bound the variance of estimators of given bias. In some cases, a biased approach can result in both a variance and a

that are the unbiased Cramér–Rao lower bound; see

estimator bias In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

Statement

The Cramér–Rao bound is stated in this section for several increasingly general cases, beginning with the case in which the parameter is a

scalar Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers * Scalar (physics), a physical quantity that can be described by a single element of a number field such ...

and its estimator is

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...

. All versions of the bound require certain regularity conditions, which hold for most well-behaved distributions. These conditions are listed later in this section.

Scalar unbiased case

Suppose

\theta

is an unknown deterministic parameter that is to be estimated from

n

independent observations (measurements) of

x

, each from a distribution according to some

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...

f(x;\theta)

. The

of any ''unbiased'' estimator

\hat

\theta

is then bounded by the

reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

of the

I(\theta)

: :

\operatorname(\hat)
\geq
\frac

where the Fisher information

I(\theta)

is defined by :

I(\theta) = n \operatorname_\theta
 \left \left(
   \frac
  \right)^2
 \right

and

\ell(x;\theta)=\log  (f(x;\theta))

is the

natural logarithm The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...

of the

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

for a single sample

x

and

\operatorname_\theta

denotes the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

with respect to the density

f(x;\theta)

X

. If

\ell(x;\theta)

is twice differentiable and certain regularity conditions hold, then the Fisher information can also be defined as follows: :

I(\theta) = -n \operatorname_\theta\left \frac \right

The

efficiency Efficiency is the often measurable ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without ...

of an unbiased estimator

\hat

measures how close this estimator's variance comes to this lower bound; estimator efficiency is defined as :

e(\hat) = \frac

or the minimum possible variance for an unbiased estimator divided by its actual variance. The Cramér–Rao lower bound thus gives :

e(\hat) \le 1

General scalar case

A more general form of the bound can be obtained by considering a biased estimator

T(X)

, whose expectation is not

\theta

but a function of this parameter, say,

\psi(\theta)

. Hence

E\ - \theta = \psi(\theta) - \theta

is not generally equal to 0. In this case, the bound is given by :

\operatorname(T)
\geq
\frac

where

\psi'(\theta)

is the derivative of

\psi(\theta)

(by

\theta

), and

I(\theta)

is the Fisher information defined above.

Bound on the variance of biased estimators

Apart from being a bound on estimators of functions of the parameter, this approach can be used to derive a bound on the variance of biased estimators with a given bias, as follows. Consider an estimator

\hat

with bias

b(\theta) = E\ - \theta

, and let

\psi(\theta) = b(\theta) + \theta

. By the result above, any unbiased estimator whose expectation is

\psi(\theta)

has variance greater than or equal to

(\psi'(\theta))^2/I(\theta)

. Thus, any estimator

\hat

whose bias is given by a function

b(\theta)

satisfies :

\operatorname \left(\hat\right)
\geq
\frac.

The unbiased version of the bound is a special case of this result, with

b(\theta)=0

. It's trivial to have a small variance − an "estimator" that is constant has a variance of zero. But from the above equation we find that the

of a biased estimator is bounded by :

\operatorname\left((\hat-\theta)^2\right)\geq\frac+b(\theta)^2,

using the standard decomposition of the MSE. Note, however, that if

1+b'(\theta)<1

this bound might be less than the unbiased Cramér–Rao bound

1/I(\theta)

. For instance, in the example of estimating variance below,

1+b'(\theta)= \frac <1

Multivariate case

Extending the Cramér–Rao bound to multiple parameters, define a parameter column

vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...

T \in \mathbb^d

with probability density function

f(x; \boldsymbol)

which satisfies the two regularity conditions below. The

Fisher information matrix In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...

is a

d \times d

matrix with element

I_

defined as :

I_
= \operatorname \left \frac \log f\left(x; \boldsymbol\right)
 \frac \log f\left(x; \boldsymbol\right)
\right = -\operatorname \left \frac \log f\left(x; \boldsymbol\right)
\right

Let

\boldsymbol(X)

be an estimator of any vector function of parameters,

\boldsymbol(X) = (T_1(X), \ldots, T_d(X))^T

, and denote its expectation vector

\operatorname boldsymbol(X) /math> by \boldsymbol(\boldsymbol) . The Cramér–Rao bound then states that the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

\boldsymbol(X)

satisfies :

I\left(\boldsymbol\right)
\geq
\phi(\theta)^T
\operatorname_\left(\boldsymbol(X)\right)^\phi(\theta)

, :

\operatorname_\left(\boldsymbol(X)\right)
\geq
\phi(\theta)
I\left(\boldsymbol\right)^
\phi(\theta)^T

where * The matrix inequality

A \ge B

is understood to mean that the matrix

A-B

is positive semidefinite, and *

\phi(\theta) := \partial \boldsymbol(\boldsymbol)/\partial \boldsymbol

is the

Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables as ...

whose

ij

element is given by

\partial \psi_i(\boldsymbol)/\partial \theta_j

. If

\boldsymbol(X)

is an

estimator of

\boldsymbol

(i.e.,

\boldsymbol\left(\boldsymbol\right) = \boldsymbol

), then the Cramér–Rao bound reduces to :

\operatorname_\left(\boldsymbol(X)\right)
\geq
I\left(\boldsymbol\right)^.

If it is inconvenient to compute the inverse of the

, then one can simply take the reciprocal of the corresponding diagonal element to find a (possibly loose) lower bound. :

\right)^.

Regularity conditions

The bound relies on two weak regularity conditions on the

f(x; \theta)

, and the estimator

T(X)

: * The Fisher information is always defined; equivalently, for all

x

such that

f(x; \theta) > 0

, ::

\frac \log f(x;\theta)

:exists, and is finite. * The operations of integration with respect to

x

and differentiation with respect to

\theta

can be interchanged in the expectation of

T

; that is, ::

\frac
 \left \int T(x) f(x;\theta) \,dx
 \right =
 \int T(x)
  \left \frac f(x;\theta)
  \right \,dx

:whenever the right-hand side is finite. :This condition can often be confirmed by using the fact that integration and differentiation can be swapped when either of the following cases hold: :# The function

f(x;\theta)

has bounded support in

x

, and the bounds do not depend on

\theta

; :# The function

f(x;\theta)

has infinite support, is continuously differentiable, and the integral converges uniformly for all

\theta

Proof

Proof for the general case based on the Chapman–Robbins bound

Proof based on.

A standalone proof for the general scalar case

Assume that

T=t(X)

is an estimator with expectation

\psi(\theta)

(based on the observations

X

), i.e. that

\operatorname(T) = \psi (\theta)

. The goal is to prove that, for all

\theta

, :

\operatorname(t(X)) \geq \frac.

Let

X

be a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

with probability density function

f(x; \theta)

. Here

T = t(X)

is a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...

, which is used as an

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...

for

\psi (\theta)

. Define

V

as the score: :

V = \frac \ln f(X;\theta) = \frac\fracf(X;\theta)

where the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...

is used in the final equality above. Then the expectation of

V

, written

\operatorname(V)

, is zero. This is because: :

\, dx = \frac\int f(x;\theta) \, dx = 0

where the integral and partial derivative have been interchanged (justified by the second regularity condition). If we consider the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the ...

\operatorname(V, T)

V

and

T

, we have

\operatorname(V, T) = \operatorname(V T)

, because

\operatorname(V) = 0

. Expanding this expression we have :

= \frac E(T) = \psi^\prime(\theta) \end

again because the integration and differentiation operations commute (second condition). The

Cauchy–Schwarz inequality The Cauchy–Schwarz inequality (also called Cauchy–Bunyakovsky–Schwarz inequality) is considered one of the most important and widely used inequalities in mathematics. The inequality for sums was published by . The corresponding inequality fo ...

shows that :

\sqrt \geq \left,  \operatorname(V,T) \ = \left ,  \psi^\prime (\theta)
\right ,

therefore :

\operatorname  (T) \geq \frac
= \frac

which proves the proposition.

Examples

Multivariate normal distribution

For the case of a ''d''-variate normal distribution :

\boldsymbol
\sim
N_d
\left(
 \boldsymbol( \boldsymbol)
 ,
  ( \boldsymbol)
\right)

the

has elements :

I_
= \frac
^
\frac
+ \frac
\operatorname
\left(
 ^
 \frac
 ^
 \frac
\right)

where "tr" is the

trace Trace may refer to: Arts and entertainment Music * ''Trace'' (Son Volt album), 1995 * ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * ''The Trace'' (album) Other uses in arts and entertainment * ''Trace'' ...

. For example, let

w /math> be a sample of N independent observations with unknown mean \theta and known variance \sigma^2 .
: w \sim \mathbb_N \left(\theta , \sigma^2  \right). Then the Fisher information is a scalar given by

: I(\theta)
=
\left(\frac\right)^T^ \left(\frac\right)
= \sum^N_\frac = \frac, and so the Cramér–Rao bound is

: \operatorname\left(\hat \theta\right)
\geq
\frac.

Normal variance with known mean

Suppose ''X'' is a normally distributed random variable with known mean

\mu

and unknown variance

\sigma^2

. Consider the following statistic: :

T=\frac.

Then ''T'' is unbiased for

\sigma^2

, as

E(T)=\sigma^2

. What is the variance of ''T''? :

\operatorname(T) = \operatorname\left(\frac\right)=\frac=\frac
\left \operatorname\left\-\left(\operatorname\\right)^2
\right

(the second equality follows directly from the definition of variance). The first term is the fourth moment about the mean and has value

3(\sigma^2)^2

; the second is the square of the variance, or

(\sigma^2)^2

. Thus :

\operatorname(T)=\frac.

Now, what is the

in the sample? Recall that the score

V

is defined as :

V=\frac\log\left L(\sigma^2,X)\right

where

L

is the

. Thus in this case, :

=-\log(\sqrt)-\frac

=-\frac+\frac

where the second equality is from elementary calculus. Thus, the information in a single observation is just minus the expectation of the derivative of

V

, or :

I
=-\operatorname\left(\frac\right)
=-\operatorname\left(-\frac+\frac\right)
=\frac-\frac
=\frac.

Thus the information in a sample of

n

independent observations is just

n

times this, or

\frac.

The Cramér–Rao bound states that :

\operatorname(T)\geq\frac.

In this case, the inequality is saturated (equality is achieved), showing that the

is efficient. However, we can achieve a lower

using a biased estimator. The estimator :

T=\frac.

obviously has a smaller variance, which is in fact :

\operatorname(T)=\frac.

Its bias is :

\left(1-\frac\right)\sigma^2=\frac

so its mean squared error is :

\operatorname(T)=\left(\frac+\frac\right)(\sigma^2)^2
=\frac

which is clearly less than what unbiased estimators can achieve according to the Cramér–Rao bound. When the mean is not known, the minimum mean squared error estimate of the variance of a sample from Gaussian distribution is achieved by dividing by

n+1

, rather than

n-1

n+2

References and notes

External links

FandPLimitTool
a GUI-based software to calculate the Fisher information and Cramér-Rao lower bound with application to single-molecule microscopy. {{DEFAULTSORT:Cramer-Rao bound Articles containing proofs Statistical inequalities Estimation theory

Statement

Scalar unbiased case

General scalar case

Bound on the variance of biased estimators

Multivariate case

Regularity conditions

Proof

Proof for the general case based on the Chapman–Robbins bound

A standalone proof for the general scalar case

Examples

Multivariate normal distribution

Normal variance with known mean

See also

References and notes

Further reading

External links